knitr::opts_chunk$set(warning=FALSE, message=FALSE)

1 Welcome

data gathered from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

2 Cleaning data

Here I am doing a few things:

What I need to do:

Import data of testing
options(stringsAsFactors=FALSE)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(plotly)
library(highcharter)


files <- list.files("raw", full.names = TRUE)

#function for grabbing status from my filenames
getStatus<- function(x){
  strsplit(x, "-")[[1]] %>%
    last() %>%
    gsub(pattern="\\.csv", replacement="")
}

#function created for adding an active status
createActive <- function(x){
  dcast(Country.Region + Date ~ Status,
        data=x, value.var="Value") %>%
    mutate(Active = Confirmed - (Deaths + Recovered)) %>% 
    melt(id = c("Country.Region", "Date"))
}

raw <- files %>% #read in files
  lapply(function(x){
  read.csv(x) %>% 
    mutate(
      Date = as.Date(Date, "%m/%d/%Y"),
      Status = getStatus(x) #add status
    )
}) %>% 
  bind_rows() %>% #combine files
  subset(!(Value==0) )

raw <- raw %>% #combine countries into one
  group_by(Country.Region, Date, Status) %>% 
  summarise(
    Value = sum(Value)
  )

raw <- raw %>% #get global stats
  group_by(Date, Status) %>% 
  summarise(
    Value = sum(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Country.Region = "Global"
  ) %>% 
  bind_rows(raw)

raw <- raw %>% #create active columns, delete nulls, rename
  createActive() %>% 
  subset(!is.na(value)) %>% 
  rename(
   Country = Country.Region,
    Value = value,
    Status = variable
  )

total <- raw %>% #Get total values for seperate df
  group_by(Country, Status) %>% 
  summarise(
    Value = max(Value)
  ) %>% 
  ungroup()

#get change per day
raw <- raw %>%
  group_by(Country, Status) %>% 
  mutate(
    Change =  Value - lag(Value, k=1),
    Rate_Change =  (Value - lag(Value, k=1))/lag(Value, k=1) 
    )

#my status' were being read in as factors
raw$Status <- as.character(raw$Status)
total$Status <- as.character(total$Status)

3 Analysis

Now that the data is clean, lets start by creating a few plots!

The above graph shows the variance of cases by type, and by country. We can see some pretty interesting things here. For example, Italy has the highest deaths and active cases, yet they are only second in the number of confirmed cases. This could mean two things:

4 Rates of Change

Theres a couple things to keep in mind when looking at the rate of change.

\(f(x) = P_0(1+r)^d\)

The rate is what gets exponentiated for an exponential function, ie, if our mean rate is higher, our confirmed cases will get exponentially higher with each day. Also, if the rate is negative that means we are starting to lose cases per day.

This graph clearly shows that the deaths rate of change has not only a higher average, but most of the rates are higher than the rate of confirmed. This tells us the number of deaths are going up much faster than our confirmed cases are.

This could be due to the fact that the Confirmed cases will always be lagging behind, or it could be due to the beginning of an overburdened healthcare system. It is also a possiblility that since our total Deaths are still low, we don’t have enough data to get a good rate, and it is still in the beginning phase, where the rates tend to be higher.

5 Deeper look at Italys’ Death Rate of Change

Looking at this scatter plot, the number of Deceased cases rise, the Rate of Change trends downwards. This means that Italy has a negative correlation between the actual change in cases per day and the rate of change. This is a good thing, as it could mean a few things. Lets find out how well correlated the two actually are.

There are quite a few outliers in our data, so I will get the outliers out of our data

Before we start, lets check to see if our Rate of Change is relatively normal.

## 
##  Shapiro-Wilk normality test
## 
## data:  std_corr$Rate_Change
## W = 0.94448, p-value = 0.1314

This tells us that our p value is 0.1314 > 0.05

This is great news! It means our data is normal, which makes the evaluation easier.

As expected there is a negative correlation between the Rate of Change and Total cases, aswell as a negative correlation between our Rate of Change and the cases per day. This makes sense as the amount of cases increase, our rate actually starts decreasing. This is great news, as it means our rate is trending downwards, ultimately suggesting that the amount of deaths per day is slowing down.

The r squared value can tell us much more, for example, we know that 88.7% of our Cases per day can be explained by our Total cases! This also tells us that only 29.4% of our Cumulative cases can be explained by our Rate of Change.

Let’s check if these values are statistically significant, to do that we need the below formula to get our test statistic.

\({ts} = \frac{r*\sqrt[2]{n-2}}{\sqrt[2]{1-r^2}}\)

##                 Value    Change Rate_Change
## Value             Inf 14.245784    3.355033
## Change      14.245784       Inf    2.206968
## Rate_Change  3.355033  2.206968         Inf

These are our test statistics for each r value, now we need to use the below formula to check if the r value is statistically significant.

\({-ts} > r > ts\)

Now, lets test to see if these values are significant, or not. Here is a matrix of our r values.

##                  Value     Change Rate_Change
## Value        1.0000000  0.9394569  -0.5424324
## Change       0.9394569  1.0000000  -0.3909310
## Rate_Change -0.5424324 -0.3909310   1.0000000

This tells us that our all our correlations are statistically significant! Which means that we can use all of our columns for a predictor of the other columns!

6 Building a model of Italys’ total Deaths

I want to first build a model of Italys’ cumulative deaths, this can further help us find the actual number of confirmed cases!

There are a few things I want to note:

So lets get started!

First, lets fix the data to show us exactly what we need.

6.1 Getting an accurate mortality rate

What I did: * Grabbed confirmed cases and Deceased Cases for each date * Calculated the Deaths per day, aswell as that rate of change * Calculated the rough Mortality rate with the formula below

$ {Mortality Rate} = $

Meaning, if we divide our cumulative Deaths by our cumulative Confirmed cases from 12 days prior. We should get a better estimate to our Mortality Rate. We can use this to estimate the actual cases per day, but first lets look at these Mortality Rates and get our average.

The above graph shows the distribution of the mortality rates, Visually this distribution does not appear to be normal, however lets test that using the Shapiro-Wilk test.

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Mortality_Rate
## W = 0.90978, p-value = 0.01109
## The mean Mortality Rate is:  0.05286329

Since our p value < 0.05, we can safely say that our mortality Rate is not a normal distribution.

Despite this, we can estimate the amount of cases per day, however I will build models for Italys’ mean mortality rate, Italys’ changing mortality rate, and the global mortality rate.

This is our Global distribution of our Mortality Rates, per day. Again, this distribution is not normal, however lets do a Shapiro-Wilk test to be safe.

## 
##  Shapiro-Wilk normality test
## 
## data:  data$Mortality_Rate
## W = 0.93786, p-value = 0.003623
## The mean Mortality Rate is:  0.03036331

As expected, this p value is < 0.05, meaning the data is not distributed normally. However we do have a mean Mortality Rate of which we can use to build our model, along with our other options.

There are two things I want to point out about Covid-19:

  • We know the median incubation period is 4 to 5 days.
    • This means that there is most likely a 5 day lag in our confirmed cases, as 50% of infected individuals will not experience any symptoms before the incubation period.
  • We know the average days from contraction to death is 17 days.
    • This can help us estimate the amount of infections 17 days prior.

sources: https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19?source=history_widget

6.3 Rough model

Great, we got most of our days filled it, however we are missing 17 values of the closest dates. This is because we used the deaths from 17 back. We also have some unknown values in the beginning due to having no deaths, however those values wont matter too much for as it wont effect our totals too much.

##          Date Infections_PD_mu Infections_PD_gl Infections_PD_var
## 1  2020-01-31               NA               NA                NA
## 2  2020-02-01               NA               NA                NA
## 3  2020-02-02               NA               NA                NA
## 4  2020-02-03               NA               NA                NA
## 5  2020-02-04               NA               NA                NA
## 6  2020-02-05               19               33                31
## 7  2020-02-06               19               33                52
## 8  2020-02-07               76              132               131
## 9  2020-02-08               57               99                97
## 10 2020-02-09               38               66                76

As shown, we now have our per day Infections!

  • Infections_PD_mu was grabbed using Italys’ static mean Mortality Rate
  • Infections_PD_gl was grabbed using the Global static mean Mortality Rate
  • Infections_PD_var was grabbed using Italys’ Mortality rate per day

As we can see, the estimated infections that ended up being the lowest ended up being Italys’ static mean calculation!

6.4 Filling in the blanks

Now that we have a start, we can use our Deaths Rate of Change as a predictor for the rest of our model! As shown earlier, the Deaths Rate of Change is statistically significant to be a predictor of our total Deaths.